- 
                Notifications
    
You must be signed in to change notification settings  - Fork 15.1k
 
[flang][runtime] Speed up initialization & destruction #148087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7597a02    to
    b9be690      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you, Peter!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // address all elements. It genernalizes contiguity by also allowing | |
| // address all elements. It generalizes contiguity by also allowing | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I no longer always see this sort of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark such methods with force-inline attribute. I am not suggesting doing it in this PR, but I wonder if you tried it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't try it, and I've run into other trouble because something somewhere has a link-time unresolved external reference now to Descriptor::Elements(). Updating soon to leave Elements out-of-line with a new inline InlineElements() that it can call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no clue how this change could have caused a linking issue..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't track it down either. The unresolved reference is not within flang_rt.runtime or the generated code for the failing test.
        
          
                flang-rt/lib/runtime/derived.cpp
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the chunk size grows, I guess, we can hit cache conflicts depending on the aliasing implementation. Do you know if there is any non-temporal memcpy implementation that we can try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should use
char *to{rawInstance + *stride};
char *from{rawInstance};
for (std::size_t bytes{...}; bytes--; ) {
  *to++ = *from++;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about that.  It will probably be the same as memcpy if the compiler vectorizes it.  I think memcpy may do better with rep movs on x86.  Let's keep it simple with memcpy.  If the cache issue pops up anywhere I will experiment with https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memcpy is not defined for overlapping regions, unfortunately.
Rework derived type initialization in the runtime to just initialize the first element of any array, and then memcpy it to the others, rather than exercising the per-component paths for each element. Reword derived type destruction in the runtime to detect and exploit a fast path for allocatable components whose types themselves don't need nested destruction. Small tweaks were made in hot paths exposed by profiling in descriptor operations and derived type assignment.
Rework derived type initialization in the runtime to just initialize the first element of any array, and then memcpy it to the others, rather than exercising the per-component paths for each element.
Reword derived type destruction in the runtime to detect and exploit a fast path for allocatable components whose types themselves don't need nested destruction.
Small tweaks were made in hot paths exposed by profiling in descriptor operations and derived type assignment.